A large English–Thai parallel corpus from the web and machine-generated text

نویسندگان

چکیده

The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an translation with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology gathering data, building parallel texts removing noisy sentence pairs are presented in reproducible manner. train models based on this dataset. Our models' performance comparable that Google Translation API (as May 2020) Thai-English outperform when the Open Parallel Corpus (OPUS) included training both dataset, pre-trained models, source code reproduce available public use.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text

We describe a set of techniques that have been developed while collecting parallel texts for Russian-English language pair and building a corpus of parallel sentences for training a statistical machine translation system. We discuss issues of verifying potential parallel texts and filtering out automatically translated documents. Finally we evaluate the quality of the 1-millionsentence corpus w...

متن کامل

News Image Annotation on a Large Parallel Text-image Corpus

In this paper, we present a multimodal parallel text-image corpus, and propose an image annotation method that exploits the textual information associated with images. Our corpus contains news articles composed of a text, images and image captions, and is significantly larger than the other news corpora proposed in image annotation papers (27,041 articles and 42,568 captionned images). In our e...

متن کامل

A Large Spanish-Catalan Parallel Corpus Release for Machine Translation

We present a large Spanish-Catalan parallel corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper. The produced corpus of 7.5 M parallel sentences (around 180 M words per language) is useful for many natural language applications. We report excellent results when building a statistical machine translation system trained on this parallel corpus. The Spanish-Catala...

متن کامل

Evaluating DOGMA-lexons Generated Automatically from a Text Corpus

Our purpose was to devise a method to evaluate the results of extracting semantic relations from text corpora in an unsupervised way. We have worked on a legal corpus (EU VAT directive) consisting of 43K words. Using a shallow parser, a set of ”lexons” has been produced. These are to be used as preprocessed material for the construction of ontologies from scratch. A knowledge engineer has judge...

متن کامل

Building Large Scale Text Corpus for Tibetan Natural Language Processing by Extracting Text from Web Pages

In this paper, we propose an approach to build a large scale text corpus for Tibetan natural language processing. We find the distribution of Tibetan web pages on the internet with a crawler which can identify whether or not a web page contains Tibetan text. Three biggest web sites are selected, and topic pages are selected with a rule based method by checking the url. The layout structures of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Language Resources and Evaluation

سال: 2021

ISSN: ['1574-020X', '1574-0218']

DOI: https://doi.org/10.1007/s10579-021-09536-6